Research on Tibetan Automatic Word Segmentation

نویسنده

  • Sun Yuan
چکیده

This paper researches on Tibetan automatic word segmentation. We focus on three key technologies of Tibetan automatic word segmentation: (1) a Tibetan automatic word segmentation approach is proposed, which is taking the advantage of case-auxiliary words and continuous feature. (2) a resolution method of overlapping ambiguity in Tibetan word segmentation is proposed, which is based on forward-backward scanning identification method and improved maximum probability algorithm. (3) through analyzing Tibetan name characteristics, an automatic recognition method of Tibetan name based on multi-features is proposed, which uses the internal features of names, contextual features and boundary features of names, and establishes the dictionary and feature base of Tibetan names. Finally, an experiment is conducted, and the results prove the methods are effective.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tibetan Unknown Word Identification from News Corpora for Supporting Lexicon-based Tibetan Word Segmentation

In Tibetan, as words are written consecutively without delimiters, finding unknown word boundary is difficult. This paper presents a hybrid approach for Tibetan unknown word identification for offline corpus processing. Firstly, Tibetan named entity is preprocessed based on natural annotation. Secondly, other Tibetan unknown words are extracted from word segmentation fragments using MTC, the co...

متن کامل

Tibetan Number Identification Based on Classification of Number Components in Tibetan Word Segmentation

Tibetan word segmentation is essential for Tibetan information processing. People mainly use the basic machine matching method which is based on dictionary to segment Tibetan words at present, because there is no segmented Tibetan corpus which can be used for training in Tibetan word segmentation. But the method based on dictionary is not fit to Tibetan number identification. This paper studies...

متن کامل

CASICT Tibetan Word Segmentation System for MLWS2017

We participated in the MLWS 2017 on Tibetan word segmentation task, our system is trained in a unrestricted way, by introducing a baseline system and 76w tibetan segmented sentences of ours. In the system character sequence is processed by the baseline system into word sequence, then a subword unit (BPE algorithm ) split rare words into subwords with its corresponding features, after that a neu...

متن کامل

Tibetan Word Segmentation as Syllable Tagging Using Conditional Random Field

In this paper, we proposed a novel approach for Tibetan word segmentation using the conditional random field. We reformulate the segmentation as a syllable tagging problem. The approach labels each syllable with a word-internal position tag, and combines syllable(s) into words according to their tags. As there is no public available Tibetan word segmentation corpus, the training corpus is gener...

متن کامل

Word Boundary Information and Chinese Word Segmentation

Chinese word segmentation could be considered as a problem of word boundary recognition. Word boundary information plays a significant role in human language acquisition and automatic segmentation for Natural Language Processing (NLP). Extraction of word boundary information involves cognitive psychology, computational linguistics, and language education. Methods utilizing word boundary informa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013